With the exponential growth of supercomputers in parallelism, applications are growing more diverse, including traditional large-scale HPC MPI jobs, and ensemble workloads such as finer-grained many-task computing (MTC) applications. Delivering high throughput and low latency for both workloads requires developing a distributed job management system that is magnitudes more scalable than today’s centralized ones. In this paper, we present a distributed job launch prototype, SLURM++, which is comprised of multiple controllers with each one managing a partition of SLURM daemons, while ZHT (a distributed key-value store) is used to store the job and resource metadata. We compared SLURM++ with SLURM using micro-benchmarks of different job sizes ...
Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are t...
This paper was submitted to Euro-Par 2010 in 2010-02-12In the age of Grid, Cloud, volunteer computin...
There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is ...
Abstract — With the exponentially growth of distributed computing systems in both flops and cores, s...
Distributed systems are growing exponentially in the computing capacity. On the high-performance com...
Abstract. The Resource and Job Management System (RJMS) is the middleware in charge of de-livering c...
Scheduling large amount of jobs/tasks over large-scale distributed systems play a significant role t...
Abstract — Task scheduling and execution over large scale, distributed systems plays an important ro...
Clusters of workstations have emerged as an important platform for building cost-effective, scalable...
In job scheduling, the concept of malleability has been explored since many years ago. Research show...
The problems of scheduling a single parallel job across a large scale distributed sys-tem are well k...
peer reviewedHigh Performance Computing (HPC) is nowadays a strategic asset required to sustain the ...
Abstract. Recent success in building petascale computing systems poses new challenges in job schedul...
SLURM is a popular resource management system that is used on many supercomputers in the TOP500 list...
In this paper we introduce a methodology for dynamic job reconfiguration driven by the programming m...
Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are t...
This paper was submitted to Euro-Par 2010 in 2010-02-12In the age of Grid, Cloud, volunteer computin...
There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is ...
Abstract — With the exponentially growth of distributed computing systems in both flops and cores, s...
Distributed systems are growing exponentially in the computing capacity. On the high-performance com...
Abstract. The Resource and Job Management System (RJMS) is the middleware in charge of de-livering c...
Scheduling large amount of jobs/tasks over large-scale distributed systems play a significant role t...
Abstract — Task scheduling and execution over large scale, distributed systems plays an important ro...
Clusters of workstations have emerged as an important platform for building cost-effective, scalable...
In job scheduling, the concept of malleability has been explored since many years ago. Research show...
The problems of scheduling a single parallel job across a large scale distributed sys-tem are well k...
peer reviewedHigh Performance Computing (HPC) is nowadays a strategic asset required to sustain the ...
Abstract. Recent success in building petascale computing systems poses new challenges in job schedul...
SLURM is a popular resource management system that is used on many supercomputers in the TOP500 list...
In this paper we introduce a methodology for dynamic job reconfiguration driven by the programming m...
Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are t...
This paper was submitted to Euro-Par 2010 in 2010-02-12In the age of Grid, Cloud, volunteer computin...
There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is ...